Olympic Data
1 Import the data
NOC Year Decade ID First.Name Name Last.Name Sex Age
1 AFG 1960 1960s 59346 Mohammad Mohammad Asif Khokan Khokan M 24
2 AFG 1960 1960s 59043 Faiz Faiz Mohammad Khakshar Khakshar M 18
3 AFG 1960 1960s 109486 Abdul Abdul Hadi Shekaib Shekaib M 20
Height Weight BMI BMI.Category Team Population GDP GDPpC
1 171 78 26.67487 3 Afghanistan 8996973 537777800 59.77319
2 162 52 19.81405 0 Afghanistan 8996973 537777800 59.77319
3 178 68 21.46194 2 Afghanistan 8996973 537777800 59.77319
Games Season City Sport Event
1 1960 Summer Summer Roma Wrestling Wrestling Men's Middleweight, Freestyle
2 1960 Summer Summer Roma Wrestling Wrestling Men's Flyweight, Freestyle
3 1960 Summer Summer Roma Athletics Athletics Men's 100 metres
Medal Medal.No.Yes
1 No Medal 0
2 No Medal 0
3 No Medal 0
[ reached 'max' / getOption("max.print") -- omitted 3 rows ]
'data.frame': 151977 obs. of 24 variables:
$ NOC : Factor w/ 122 levels "AFG","ALB","AND",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Year : int 1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...
$ Decade : Factor w/ 6 levels "1960s","1970s",..: 1 1 1 1 1 1 1 1 1 1 ...
$ ID : int 59346 59043 109486 59102 128736 29626 39922 106372 128736 58364 ...
$ First.Name : Factor w/ 14118 levels "","A","A.","Aadam",..: 8716 3731 64 599 64 11978 64 4634 64 8716 ...
$ Name : Factor w/ 74268 levels " Gabrielle Marie \"Gabby\" Adcock (White-)",..: 48941 19066 218 3341 220 64832 215 23793 220 48946 ...
$ Last.Name : Factor w/ 47370 levels "","-)","-Alard)",..: 23228 23112 38893 23137 44908 13260 16633 37860 44908 22890 ...
$ Sex : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 2 2 ...
$ Age : int 24 18 20 35 20 28 22 23 20 20 ...
$ Height : int 171 162 178 166 179 168 172 170 179 166 ...
$ Weight : num 78 52 68 66 75 73 70 58 75 62 ...
$ BMI : num 26.7 19.8 21.5 24 23.4 ...
$ BMI.Category: Factor w/ 5 levels "0","1","2","3",..: 4 1 3 3 3 4 3 3 3 3 ...
$ Team : Factor w/ 332 levels "Acipactli","Afghanistan",..: 2 2 2 2 2 2 2 2 2 2 ...
$ Population : int 8996973 8996973 8996973 8996973 8996973 8996973 8996973 8996973 8996973 8996973 ...
$ GDP : num 5.38e+08 5.38e+08 5.38e+08 5.38e+08 5.38e+08 ...
$ GDPpC : num 59.8 59.8 59.8 59.8 59.8 ...
$ Games : Factor w/ 30 levels "1960 Summer",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Season : Factor w/ 2 levels "Summer","Winter": 1 1 1 1 1 1 1 1 1 1 ...
$ City : Factor w/ 29 levels "Albertville",..: 19 19 19 19 19 19 19 19 19 19 ...
$ Sport : Factor w/ 51 levels "Alpine Skiing",..: 51 51 3 51 3 51 3 3 3 51 ...
$ Event : Factor w/ 489 levels "Alpine Skiing Men's Combined",..: 478 468 17 476 33 482 22 24 18 466 ...
$ Medal : Factor w/ 4 levels "Bronze","Gold",..: 3 3 3 3 3 3 3 3 3 3 ...
$ Medal.No.Yes: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
Done.
2 Time Series
2.1 Number of Events
Lets look to see
Number of sports per year at the https://www.topendsports.com/events/summer/sports/number.htm
There is clearly an upward trend, but no seasonal pattern. The data is also a little choppy at the beginning. Part of the explanation is that the data points are not evenly spaced. Most Olympic games are 4 years apart, but a few of them are just 2 years apart, and during World War I and World War II there were 8-year and 12-year gaps, respectively. Since time series data should be evenly spaced over time, we’ll only look at data from 1948 on, when the Olympics started being held every 4 years without any interruptions.
Lets see if I can build a time series using our data.
Time Series:
Start = 1948
End = 2020
Frequency = 1
Year Num.Sports
1948 1948 17
1949 1952 17
1950 1956 17
1951 1960 17
1952 1964 19
1953 1968 18
1954 1972 21
1955 1976 21
1956 1980 21
1957 1984 21
1958 1988 23
1959 1992 25
1960 1996 26
1961 2000 28
1962 2004 28
1963 2008 28
1964 2012 26
1965 2016 28
1966 2020 33
1967 1948 17
1968 1952 17
1969 1956 17
1970 1960 17
1971 1964 19
1972 1968 18
1973 1972 21
1974 1976 21
1975 1980 21
1976 1984 21
1977 1988 23
1978 1992 25
1979 1996 26
1980 2000 28
1981 2004 28
1982 2008 28
1983 2012 26
1984 2016 28
[ reached getOption("max.print") -- omitted 36 rows ]
2.2 Creating the models
I’m going to try 4 different models.
\[ y_{\text{linear}}(x) = ax+b \\ y_{\text{quadratic}}(x) = ax^2 + bx + c \\ y_{\text{exponential}}(x) = a\exp(bx) + c \\ y_{\text{cubic}}(x) = ax^3 + bx^2 + cx + d \]
And I’ll be able to use ANOVA to test the nested models: linear vs quadratic, and exponential growth vs s-curve (sigmoid).
These models all look fairly similar. Lets check using ANOVA.
| Res.Df | Res.Sum Sq | Df | Sum Sq | F value | Pr(>F) |
|---|---|---|---|---|---|
| 17 | 33.33158 | NA | NA | NA | NA |
| 16 | 30.41244 | 1 | 2.919136 | 1.535759 | 0.2331213 |
| 16 | 32.28073 | 0 | 0.000000 | NA | NA |
| 15 | 29.04971 | 1 | 3.231026 | 1.668361 | 0.2160269 |
Linear model preferred. Nothing gained from adding complexity to the model.
Lets look at the top 10 sports by number of participants.
Sport freq
3 Athletics 19641
41 Swimming 14094
22 Gymnastics 13175
12 Cross Country Skiing 6134
1 Alpine Skiing 5649
14 Cycling 5567
31 Rowing 5325
34 Shooting 5307
17 Fencing 5073
11 Canoeing 4198
| Sport | freq | |
|---|---|---|
| 3 | Athletics | 19641 |
| 41 | Swimming | 14094 |
| 22 | Gymnastics | 13175 |
| 12 | Cross Country Skiing | 6134 |
| 1 | Alpine Skiing | 5649 |
| 14 | Cycling | 5567 |
| 31 | Rowing | 5325 |
| 34 | Shooting | 5307 |
| 17 | Fencing | 5073 |
| 11 | Canoeing | 4198 |
I need to subset the data because I keep getting the following error: “Error: vector memory exhausted (limit reached?)”. I will drop the following variables: NOC, Decade, ID, First.Name, Name, BMI, BMI.Category, Games, City, Event. I will only focus on the top ten sports.
[1] 151977
[1] 58693
Lets try making logistic regression models for Weight and Height.
Year Mean_Weight StdDev_Weight Mean_Height StdDev_Height Sport Sex
1 1924 64.00000 0.000000 167.0000 0.000000 Swimming F
2 1956 61.00000 4.780914 169.7333 3.634491 Swimming F
3 1960 62.73469 5.619073 169.3469 6.839076 Swimming F
4 1964 63.06000 6.466270 171.3600 4.378799 Swimming F
5 1968 62.45455 5.361348 170.3636 4.583033 Swimming F
6 1972 60.23611 5.491333 170.3889 4.949194 Swimming F
'data.frame': 339 obs. of 7 variables:
$ Year : int 1924 1956 1960 1964 1968 1972 1976 1980 1984 1988 ...
$ Mean_Weight : num 64 61 62.7 63.1 62.5 ...
$ StdDev_Weight: num 0 4.78 5.62 6.47 5.36 ...
$ Mean_Height : num 167 170 169 171 170 ...
$ StdDev_Height: num 0 3.63 6.84 4.38 4.58 ...
$ Sport : Factor w/ 10 levels "Basketball","Canoeing",..: 9 9 9 9 9 9 9 9 9 9 ...
$ Sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
Year Mean_Weight StdDev_Weight Mean_Height StdDev_Height Sport Sex
1 1924 64.00000 0.000000 167.0000 0.000000 Swimming F
2 1956 61.00000 4.780914 169.7333 3.634491 Swimming F
3 1960 62.73469 5.619073 169.3469 6.839076 Swimming F
4 1964 63.06000 6.466270 171.3600 4.378799 Swimming F
5 1968 62.45455 5.361348 170.3636 4.583033 Swimming F
6 1972 60.23611 5.491333 170.3889 4.949194 Swimming F
| Medal | mean |
|---|---|
| Bronze | 25.55859 |
| Gold | 25.28269 |
| No Medal | 24.93049 |
| Silver | 25.48383 |
# A tibble: 6 x 3
# Groups: Year [3]
Year Sex mean.Age
<int> <fct> <dbl>
1 1960 F 21.6
2 1960 M 26.0
3 1964 F 21.5
4 1964 M 25.7
5 1968 F 20.5
6 1968 M 25.1
2.3 Swimming
2.3.1 Models
2.3.1.1 Female Athletes
| Res.Df | Res.Sum Sq | Df | Sum Sq | F value | Pr(>F) |
|---|---|---|---|---|---|
| 13 | 8.5085546 | NA | NA | NA | NA |
| 12 | 3.3398817 | 1 | 5.168673 | 18.57074 | 0.0010150 |
| 12 | 8.1614545 | 0 | 0.000000 | NA | NA |
| 11 | 0.6470143 | 1 | 7.514440 | 127.75427 | 0.0000002 |
| Res.Df | Res.Sum Sq | Df | Sum Sq | F value | Pr(>F) |
|---|---|---|---|---|---|
| 13 | 11.697173 | NA | NA | NA | NA |
| 12 | 5.842805 | 1 | 5.854368 | 12.02375 | 0.0046521 |
| 12 | 11.347295 | 0 | 0.000000 | NA | NA |
| 11 | 2.432617 | 1 | 8.914677 | 40.31109 | 0.0000545 |
| Res.Df | Res.Sum Sq | Df | Sum Sq | F value | Pr(>F) |
|---|---|---|---|---|---|
| 13 | 4.326164 | NA | NA | NA | NA |
| 12 | 4.163567 | 1 | 0.162597 | 0.4686279 | 0.5066258 |
| 12 | 4.408289 | 0 | 0.000000 | NA | NA |
| 11 | 2.732882 | 1 | 1.675407 | 6.7436063 | 0.0248333 |
2.3.1.2 Male Athletes
| Res.Df | Res.Sum Sq | Df | Sum Sq | F value | Pr(>F) |
|---|---|---|---|---|---|
| 13 | 3.7207136 | NA | NA | NA | NA |
| 12 | 1.8313700 | 1 | 1.889344 | 12.37987 | 0.0042351 |
| 12 | 3.5597205 | 0 | 0.000000 | NA | NA |
| 11 | 0.4074947 | 1 | 3.152226 | 85.09186 | 0.0000016 |
| Res.Df | Res.Sum Sq | Df | Sum Sq | F value | Pr(>F) |
|---|---|---|---|---|---|
| 13 | 13.008921 | NA | NA | NA | NA |
| 12 | 12.927824 | 1 | 0.0810975 | 0.0752771 | 0.7884689 |
| 12 | 12.958849 | 0 | 0.0000000 | NA | NA |
| 11 | 2.535541 | 1 | 10.4233087 | 45.2197021 | 0.0000327 |
| Res.Df | Res.Sum Sq | Df | Sum Sq | F value | Pr(>F) |
|---|---|---|---|---|---|
| 13 | 10.265883 | NA | NA | NA | NA |
| 12 | 4.739344 | 1 | 5.526539 | 13.99317 | 0.0028181 |
| 12 | 10.762827 | 0 | 0.000000 | NA | NA |
| 11 | 3.544380 | 1 | 7.218447 | 22.40248 | 0.0006162 |
```